# Data and Code for: A Field Experiment on Antitrust Compliance

# README: Stata Pipeline for Cartel Screening and Empirical Analysis

**Authors**
Kei Kawai (UC Berkeley; University of Tokyo)
Jun Nakabayashi (Kyoto University)

**Corresponding author:** nakabayashi.jun.8x@kyoto-u.ac.jp

This project contains a Stata-based pipeline to conduct cartel screening, treatment assignment, and empirical analysis based on auction data.
Software requirements: Stata 18.5

## Software and Computational Requirements

- **Stata:** 18.5 SE.
- **Controlled randomness:** set seed 10 is used for reproducibility.
- **User-written packages (bundled):**
  - `src/ado_rdrobust920/plus/` — primary toolchain (default)
  - `src/ado_rdrobust760/plus/` — legacy toolchain used **only in Step 1** (screening)
- **Ado path handling:** The master script sets the PLUS directory and the `.do` files switch to the 760 toolchain for Step 1, then restore 920 automatically. You do **not** need to change paths manually.
  ```stata
  sysdir set PLUS "src/ado_rdrobust920/plus"
  ```

## Reproducing with Docker (prebuilt image)

We provide a prebuilt Docker image for this package:

- **Image:** `nakabayashi1/replication3jan26:latest`
- **Base image:** `dataeditors/stata18_5-se-i:2025-02-26` (AEA Data Editor Stata 18.5 SE)

Pull and run (requires a valid Stata license file):
(We write out all of the codes needed to run docker for one of the author's environment below as an example.)

```bash
docker pull nakabayashi1/replication3jan26:latest

docker run -it --rm \
  -v /path/to/your/stata.lic:/usr/local/stata/stata.lic \
  nakabayashi1/replication3jan26:latest
```

Inside the container, run STATA in batch:
```bash
cd src
do main.do
```

Result retrieval
To retrieve results, detach the docker by ctrl + p, ctrl + q and run
```bash
docker cp \
  /your/container/name:project ./replication_results
```
Then, all output files (including the stata replication codes) are transferred to the ``replication_results'' folder in your local environment. Please DO NOT 'exit' the container before retrieving the results in the container.

Example (Windows11 + WSL)

To obtain and build the docker container
```bash
sudo chmod 666 /var/run/docker.sock
docker pull nakabayashi1/replication3jan26
docker run -it --rm \
  -v /mnt/c/'Program Files'/Stata18/STATA.LIC:/usr/local/stata/stata.lic \
  nakabayashi1/replication3jan26:latest

# then inside the container:
cd src
do main.do
```

To retrieve the results from the container, detach the container by ctrl + p + q and run
```bash
docker ps -a
docker cp determined_nobel:project ./replication_results
```
Note that `determined_nobel` is the container name in my case, which must be changed appropriately


## Ado Dependencies and Versioning Strategy

This package bundles two ado repositories to ensure reproducibility:

- `src/ado_rdrobust920/plus/` — primary toolchain (default for the pipeline)
- `src/ado_rdrobust760/plus/` — legacy toolchain required **only for Step 1** (screening)

The `.do` files handle ado path switching internally:
- The pipeline runs under the **920** toolchain by default.
- When entering **Step 1**, the scripts temporarily switch to the **760** toolchain and then restore **920** for subsequent steps.

You do **not** need to modify ado paths manually. All scripts use relative paths and forward slashes (`/`).



## Data Availability and Provenance

OS tested: Windows 11 (AMD Ryzen 7 6800U, 32 GB)
Memory: ≥ 16 GB
Disk: ≥ 10 GB free (MLIT files are large).


All raw data used in this project are **redistributable** and are **included** in the `data/` folder.

1. `mlitbid_090122a.dta` — MLIT auction bids FY2014–2020, with bidder license IDs and addresses.
   **Provenance:** Ministry of Land, Infrastructure, and Transport (MLIT), information disclosure requests (Jul. 2020 and Nov. 2021).
   **Rights:** Redistributable; included in this package.
   **Used in:** Fisher test; graphical analysis.

2. `biddata020119.dta` — MLIT auction bids FY2013–2017.
   **Provenance:** MLIT, information disclosure request (Jun. 2019).
   **Rights:** As above.
   **Used in:** Cartel screening (Step 1), summary statistics.

3. `rid_address_license_firmID_020119.dta` — subset with license and firm IDs, FY2013–2017.
   **Provenance/Rights/Access:** As in (2).

4. `sales_firm_license.xlsx` — Sales and engineer counts for 242 firms flagged as collusive.
   **Provenance:** Constructed by authors from public firm disclosures http://www7.ciic.or.jp/.
   **Rights:** Redistributable; included in this package.
   **Date accessed:** 27Sep22

5. `ring_list020119.dta` — List of 242 firms flagged as collusive by our screening.
   **Provenance:** Constructed by authors.
   **Rights:** Redistributable; included in this package.

6. `address_license_121818.csv` — Registry of all licensed construction firms (addresses as of 2018).
   **Provenance:** Public registry (https://etsuran2.mlit.go.jp/TAKKEN/).
   **Rights:** Redistributable; included in this package.
   **Date accessed:** 18Dec18


---

## Implementation
**Total runtime:** ~9–15 hours on a modern laptop (Step 1 dominates); other steps are typically < 30 minutes each.

To execute the full pipeline from start to finish, run the main script:

```stata
do src/main.do
```

This script will:

- Run all Stata `.do` files in the correct order
- Construct the analysis samples
- Generate all tables and figures for both the main paper and the appendix
Make sure all dependencies (data files and the bundled ado folders) are present before running main.do.


## Details of main.do
### Initial Setup

Start by running:

```stata
do modules/_init.do
```
- Creates folders: `tmp/`, `tables/`, and `graph/`
- Sets the Stata environment and ado directory
- Does **not** define global macros or paths

---

### Data Construction

#### Step 1: Firm-Level Screening

```stata
do Step1_screening.do
```

- Duration: 9–12 hours
- Outputs:
  - `firmleveltest/test_sample_with_results.dta`
  - `firmleveltest/graph_sp1x2/*.png` (1143 firm-level scatter plots, some used as Figs. OA1 - OA5)
  - `firmleveltest/graph_combined_gph/*.gph` (combined plots used in online appendix)
- Note: Implements cartel-screening for each firm

#### Step 2: Clustering

```stata
do Step2_clustering.do
```

- Duration: about 5 mins
- Output: `firmleveltest/ringfirms_plus.dta`
- Note: Classifies firms into 26 groups using a clustering algorithm

#### Step 3: Treatment Assignment

```stata
do Step3_treatment_assignment.do
```
- Duration: about 5 mins
- Output: `fishertest/group_treatment.dta`
- Note: Assigns treatment status to firm groups

#### Step 4: Data Construction (Part 1)

```stata
do Step4_data_construction1.do
```
- Duration: less than 5 mins
- Output: `tmp/temp_biddata_090122.dta`
- Uses: `mlitbid_090122a.dta` and `address_license_121818.csv`
- Note: Constructs firm-level dataset for main analysis

#### Step 5: Data Construction (Part 2)

```stata
do Step5_data_construction2.do
```
- Duration: about 10 mins
- Output: `tmp/sample_090122.dta`
- Note: Finalizes the dataset used in the main analysis

---

### Main Analysis

> Run only after Steps 1–5 are completed.

#### Summary Statistics

```stata
do modules/_summary_stat_table.do
do modules/_summary_stat_table_firm.do
```
- Duration: less than 5 mins
- Outputs:
  - `tables/summary_stats_auctions.tex` (Table 2)
  - `tables/summary_stats_firms.tex` (Table 3)

#### FDR Analysis

```stata
do modules/_fdr.do
```
- Duration: less than 5 mins
- Outputs:
  - `graph/fdr_noshrinkage_cdfplot_ftt_ts1s.png` (Fig. 3)
  - `graph/fdr_noshrinkage_cdfplot_ftt_ts1p.png` (Fig. OA5)

#### Time-Series Analysis

```stata
do modules/_time_series_plot.do
```
- Duration: about 5 mins
- Outputs:
  - `graph/twowaylpoly_abovereserve_timeseries.png` (Fig. 9)
  - `graph/twowaylpoly_pctlosingbids_timeseries.png` (Fig. 8 Left)
  - `graph/twowaylpoly_winbidpct_timeseries.png` (Fig. 8 Right)
  - `graph/twowaylpoly_q_winners_timeseries.png` (Fig. 10 Right)
  - `graph/twowaylpoly_q_losers_timeseries.png` (Fig. 10 Left)

#### RD Scatter Plots

```stata
do modules/_draw_main_scatter_plots.do
```

- Duration: about 5 mins
- Outputs:
  - `graph/Treatmentfirms_s_by31mar2021_condition_15mar2019.png` (Fig. 4 Top)
  - `graph/Controlfirms_s_by31mar2021_condition_15mar2019.png` (Fig. 4 Bottom)
  - `graph/Treatmentfirms_p_by31mar2021_condition_15mar2019.png` (Fig. 5 Top)
  - `graph/Controlfirms_p_by31mar2021_condition_15mar2019.png` (Fig. 5 Bottom)

#### Nippo Figures and Table

```stata
do modules/_nippo.do
```
- Duration: less than 5 mins
- Outputs:
  - `graph/NIPPO.png` (Fig. 1)
  - `graph/NIPPOp.png` (Fig. 2)
  - `tables/nippo.tex` (Table 1)

#### Fisher Randomization Test

```stata
do modules/_fisher_test.do
```

- Duration: 20–40 minutes
- Outputs:
  - `graph/fisher_scatter__auction_mean.png` (Fig. 7)
  - `graph/diffdiff_tau_score_tau_price_auction_mean.png` (Fig. 6)
  - `graph/diffdiff_pctbid_w_pctbid_l_bidder_mean.png` (Fig. 11 Left)
  - `graph/diffdiff_q_w_q_l_bidder_mean.png` (Fig. 11 Right)
  - `graph/diffdiff_abovereserve_l_bidder_mean.png` (Fig. 12)
  Note:
  - The log reports summary counts used in Fig. 7, the total number of points (764) and the number lying in the upper-left region relative to the 45° line (43). The log banner looks like:

  - We first submitted a replication package in which the results were different when running with an AMD machine and an Intel machine. In particular, an Intel Core i5-8250U machine faile to compute the  fisher test for groups 5, 16, 27, 28 (test 1) and  groups 3, 6, 8, 9, 11, 19 (test2).
 To elimiate hardware-dependent differences, we increased the number of retries in the rdrobust estimation in this replication package. This update resulted in Intel CPUs being able to compute the Fisher test for more groups, but also resulted in more tests to be computed relative to the published version with AMD machines. We delete the results that were not computed originally under AMD CPU. As a result of this update, Intel based machines and AMD based machines give identical results except for group 23. However, because we still observed failures for group 23 under Intel machines, we decided to exclude that specific group. See also comment /*** reproducibility issue ***/ in 'modules/__rd_for_fisher_test.do'

---

### Online Appendix Outputs

#### RD Binned Scatter

```stata
do modules/_binscatter_form_main_plots.do
```

- Duration: less than 5 minutes
- Outputs:
  - `graph/binscatter_Treatmentfirms_s_by.png` (OB.1 Top)
  - `graph/binscatter_Controlfirms_s_by.png` (OB.1 Bottom)
  - `graph/binscatter_Treatmentfirms_p_by.png` (OB.2 Top)
  - `graph/binscatter_Controlfirms_p_by.png` (OB.2 Bottom)

#### Event-Study Format Time Series

```stata
do modules/_time_series_event_study_form.do
```
- Duration: less than 5 minutes
- Outputs:
  - `graph/event_twowaylpoly_pctlosingbids_timeseries.png` (OB.3 Right)
  - `graph/event_twowaylpoly_winbidpct_timeseries.png` (OB.3 Left)
  - `graph/event_twowaylpoly_abovereserve_timeseries.png` (OB.4)
  - `graph/event_twowaylpoly_q_winners_timeseries.png` (OB.5 Left)
  - `graph/event_twowaylpoly_q_losers_timeseries.png` (OB.5 Right)

#### Distance Matrix (OA.1)

```stata
do modules/_distance_matrix_groupwise.do
distance_matrix_groupwise
```

- Duration: less than 5 minutes
- Requires: `firmleveltest/test_sample_with_results.dta`
- Outputs:
  - `firmleveltest/freq_matrix28_Hokkaido.tex` (Top)
  - `firmleveltest/freq_matrix28_Eastern_Japan.tex` (Middle)
  - `firmleveltest/freq_matrix28_Western_Japan.tex` (Bottom)

#### Classification Tables (End of OA)

```stata
do modules/_classification_table.do
```
- Duration: about 5 mins
- Requires: `firmleveltest/test_sample_with_results.dta`
- Outputs: `tables/classification[1-10].csv`
---
